Show the code
import pandas as pd
import numpy as np
import json
from lets_plot import *
LetsPlot.setup_html(isolated_frame=True)import pandas as pd
import numpy as np
import json
from lets_plot import *
LetsPlot.setup_html(isolated_frame=True)# Learn morea about Code Cells: https://quarto.org/docs/reference/cells/cells-jupyter.html
df = pd.read_json("https://github.com/byuidatascience/data4missing/raw/master/data-raw/flights_missing/flights_missing.json")
## QUESTION|TASK 1
df_clean = df.replace(['', 'n/a', 'NA', -999], np.nan)
example_record = df_clean[df_clean.isna().any(axis=1)].iloc[0].to_dict()
print(json.dumps(example_record, indent=2)){
"airport_code": "ATL",
"airport_name": "Atlanta, GA: Hartsfield-Jackson Atlanta International",
"month": "January",
"year": 2005.0,
"num_of_flights_total": 35048,
"num_of_delays_carrier": "1500+",
"num_of_delays_late_aircraft": NaN,
"num_of_delays_nas": 4598,
"num_of_delays_security": 10,
"num_of_delays_weather": 448,
"num_of_delays_total": 8355,
"minutes_delayed_carrier": 116423.0,
"minutes_delayed_late_aircraft": 104415,
"minutes_delayed_nas": 207467.0,
"minutes_delayed_security": 297,
"minutes_delayed_weather": 36931,
"minutes_delayed_total": 465533
}
Which airport has the worst delays? Describe the metric you chose, and why you chose it to determine the “worst” airport. Your answer should include a summary table that lists (for each airport) the total number of flights, total number of delayed flights, proportion of delayed flights, and average delay time in hours.
_I found that the “worst” airport based on the proportion of delayed flights. This metric is superior to total delay counts because it normalizes for airport size; a large hub will naturally have more total delays, but the proportion tells a passenger their actual risk of being delayed. By this metric, San Francisco is the worst airport, with 26% of all flights experiencing delays.
cols_to_fix = ['num_of_flights_total', 'num_of_delays_total', 'minutes_delayed_total']
df_clean[cols_to_fix] = df_clean[cols_to_fix].apply(pd.to_numeric, errors='coerce')
airport_summary = df_clean.groupby('airport_code').agg(
total_flights=('num_of_flights_total', 'sum'),
total_delayed=('num_of_delays_total', 'sum'),
total_minutes_delayed=('minutes_delayed_total', 'sum')
).reset_index()
airport_summary['proportion_delayed'] = airport_summary['total_delayed'] / airport_summary['total_flights']
airport_summary['avg_delay_hours'] = (airport_summary['total_minutes_delayed'] / airport_summary['total_delayed']) / 60
print(airport_summary.to_markdown(index=False))| airport_code | total_flights | total_delayed | total_minutes_delayed | proportion_delayed | avg_delay_hours |
|:---------------|----------------:|----------------:|------------------------:|---------------------:|------------------:|
| ATL | 4430047 | 902443 | 53983926 | 0.20371 | 0.996996 |
| DEN | 2513974 | 468519 | 25173381 | 0.186366 | 0.895495 |
| IAD | 851571 | 168467 | 10283478 | 0.197831 | 1.01736 |
| ORD | 3597588 | 830825 | 56356129 | 0.230939 | 1.13053 |
| SAN | 917862 | 175132 | 8276248 | 0.190804 | 0.78762 |
| SFO | 1630945 | 425604 | 26550493 | 0.260955 | 1.03972 |
| SLC | 1403384 | 205160 | 10123371 | 0.146189 | 0.822396 |
What is the best month to fly if you want to avoid delays of any length? Describe the metric you chose and why you chose it to calculate your answer. Include one chart to help support your answer, with the x-axis ordered by month. (To answer this question, you will need to remove any rows that are missing the Month variable.)
The best month to fly is September, closely followed by November. I chose the proportion of delayed flights per month as the metric because it indicates the likelihood of a delay occurring regardless of the total volume of travel in that month. The chart below shows a clear dip in delay proportions during the fall months.
df_month = df_clean.dropna(subset=['month'])
month_summary = df_month.groupby('month').agg(
total_flights=('num_of_flights_total', 'sum'),
total_delayed=('num_of_delays_total', 'sum')
).reset_index()
month_summary['proportion_delayed'] = month_summary['total_delayed'] / month_summary['total_flights']
month_order = ['January', 'February', 'March', 'April', 'May', 'June',
'July', 'August', 'September', 'October', 'November', 'December']
plot = (ggplot(month_summary, aes(x='month', y='proportion_delayed')) +
geom_bar(stat='identity', fill='#5e81ac') +
scale_x_discrete(limits=month_order) +
labs(title='Proportion of Delayed Flights by Month', x='Month', y='Proportion Delayed') +
theme(axis_text_x=element_text(angle=45, hjust=1))
)
plot.show()According to the BTS website, the “Weather” category only accounts for severe weather delays. Mild weather delays are not counted in the “Weather” category, but are actually included in both the “NAS” and “Late-Arriving Aircraft” categories. Your job is to create a new column that calculates the total number of flights delayed by weather (both severe and mild). You will need to replace all the missing values in the Late Aircraft variable with the mean. Show your work by printing the first 5 rows of data in a table. Use these three rules for your calculations:
a. 100% of delayed flights in the Weather category are due to weather
a. 30% of all delayed flights in the Late-Arriving category are due to weather
a. From April to August, 40% of delayed flights in the NAS category are due to weather. The rest of the months, the proportion rises to 65%
I replaced missing values in the late_aircraft column with the column mean. I then applied the specific business rules to create a new weather_delayed_flights column. The first 5 rows of the dataset, including this new calculation, are displayed below.
mean_late = df_clean['num_of_delays_late_aircraft'].mean()
df_clean['num_of_delays_late_aircraft'] = df_clean['num_of_delays_late_aircraft'].fillna(mean_late)
def calculate_weather_delay(row):
severe = row['num_of_delays_weather']
late = row['num_of_delays_late_aircraft'] * 0.30
if row['month'] in ['April', 'May', 'June', 'July', 'August']:
nas_part = row['num_of_delays_nas'] * 0.40
else:
nas_part = row['num_of_delays_nas'] * 0.65
return severe + late + nas_part
df_clean['weather_delayed_flights'] = df_clean.apply(calculate_weather_delay, axis=1)
print(df_clean[['airport_code', 'month', 'weather_delayed_flights']].head().to_markdown(index=False))| airport_code | month | weather_delayed_flights |
|:---------------|:--------|--------------------------:|
| ATL | January | 3769.43 |
| DEN | January | 1119.15 |
| IAD | January | 960.15 |
| ORD | January | 4502.25 |
| SAN | January | 674.7 |
Using the new weather variable calculated above, create a barplot showing the proportion of all flights that are delayed by weather at each airport. Describe what you learn from this graph.
The graph illustrates that San Francisco is significantly more impacted by weather related delays than other airports, with nearly 7% of all flights delayed due to weather conditions. In contrast, airports like San Diego are much less affected by weather.
weather_summary = df_clean.groupby('airport_code').agg(
total_flights=('num_of_flights_total', 'sum'),
total_weather_delays=('weather_delayed_flights', 'sum')
).reset_index()
weather_summary['proportion_weather_delay'] = weather_summary['total_weather_delays'] / weather_summary['total_flights']
plot_weather = (ggplot(weather_summary, aes(x='airport_code', y='proportion_weather_delay', fill='airport_code')) +
geom_bar(stat='identity') +
labs(title='Proportion of Flights Delayed by Weather per Airport',
x='Airport', y='Proportion') +
theme(legend_position='none')
)
plot_weather.show()Which delay is the worst delay? Create a similar analysis as above for Weahter Delay with: Carrier Delay and Security Delay. Compare the proportion of delay for each of the three categories in a Chart and a Table. Describe your results.
Comparing Weather, Carrier, and Security delays reveals that Carrier delays are consistently the most frequent cause of delay among these three categories, generally accounting for a higher proportion of total flights than Weather or Security. Security delays are negligible across all airports.
comparison_summary = df_clean.groupby('airport_code').agg(
total_flights=('num_of_flights_total', 'sum'),
sum_weather=('weather_delayed_flights', 'sum'),
sum_carrier=('num_of_delays_carrier', 'sum'),
sum_security=('num_of_delays_security', 'sum')
).reset_index()
comparison_summary['sum_carrier'] = pd.to_numeric(comparison_summary['sum_carrier'], errors='coerce')
comparison_summary['sum_security'] = pd.to_numeric(comparison_summary['sum_security'], errors='coerce')
comparison_summary['Weather'] = comparison_summary['sum_weather'] / comparison_summary['total_flights']
comparison_summary['Carrier'] = comparison_summary['sum_carrier'] / comparison_summary['total_flights']
comparison_summary['Security'] = comparison_summary['sum_security'] / comparison_summary['total_flights']
comparison_melted = comparison_summary.melt(
id_vars=['airport_code'],
value_vars=['Weather', 'Carrier', 'Security'],
var_name='Delay_Type',
value_name='Proportion'
)
plot_compare = (ggplot(comparison_melted, aes(x='airport_code', y='Proportion', fill='Delay_Type')) +
geom_bar(stat='identity', position='dodge') +
labs(title='Comparison of Delay Types by Airport', x='Airport', y='Proportion of Flights')
)
plot_compare.show()